Preprocessing

There are several steps followed for preprocessing the data.

First, the dataset is completed. If the dataframe extracted from MIO is incomplete, i.e., contains only rows with non-zero sales and no rows with 0 sales, thus having a lot of missing dates, the missing rows are added to the dataframe for each brand.

All dates from a certain starting point (usually 2018-01-01) to the end of desired forecasting period (usually 2025-02-01) are added.
```
import pandas as pd

all_dates = pd.date_range(start="2018-01-01", end="2025-02-01", freq="D")
all_brand_names = df["brand"].unique()

all_combinations = pd.DataFrame(
    [(date, brand)
    for date in all_dates
    for brand in all_brand_names],
    columns=["date", "brand"]
)

df = pd.merge(all_combinations, df, on=["date", "brand"], how='left')    
df.fillna(0, inplace=True)
df = df[df["brand"] != 0]
```
Missing values (for quantity delivered) are filled with 0 and if any brand is originally NaN and has been filled with 0 too, it is removed.

Next, the data is aggregated to monthly.

df["date"] = df["date"].dt.to_period("M").dt.to_timestamp()
df = df.groupby(["brand", "date"])["quantity_delivered"].sum().reset_index()

product_values = df[(df["date"] >= "2023-01-01") & (df["date"] <= "2023-12-01")].groupby("brand")["quantity_delivered"].sum()
discontinued_products = product_values.loc[product_values.values < 1]
df = df[~df["brand"].isin(discontinued_products.index)]
df.reset_index(drop=True, inplace=True)

Furthermore, brands which have been discontinued in the region (no sales in 2023) are removed.

Weather data is merged with the dataframe optionally.

weather_df["date"] = pd.to_datetime(weather_df["date"])
weather_df["date"] = weather_df["date"].dt.to_period("M").dt.to_timestamp()
weather_df = weather_df.groupby(["date"]).mean().reset_index()

weather_df_date_max = weather_df["date"].max()
previous_year_start = (weather_df_date_max - relativedelta(years=1)).strftime('%Y-%m-%d')
previous_year_end = (datetime.strptime("2025-02-01", '%Y-%m-%d') - relativedelta(years=1)).strftime('%Y-%m-%d')

weather_df_features = list(weather_df.columns)
weather_df_features.remove("date")

df = pd.merge(df, weather_df, on="date", how='left')

while df[df["date"] > weather_df_date_max][weather_df_features].isna().sum().sum() > 0:
    df.loc[df["date"] > weather_df_date_max, weather_df_features] = df.loc[(df["date"] > previous_year_start) & (df["date"] <= previous_year_end), weather_df_features].values

df.fillna(0, inplace=True)

The weather data for future time periods is filled with weather data of the previous year.